Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics

نویسندگان

  • Moses Charikar
  • Vaggos Chatziafratis
چکیده

Dasgupta recently introduced a cost function for the hierarchical clustering of a set of points given pairwise similarities between them. He showed that this function is NP -hard to optimize, but a top-down recursive partitioning heuristic based on an αn-approximation algorithm for uniform sparsest cut gives an approximation of O(αn logn) (the current best algorithm has αn = O( √ log n)). We show that the aforementioned sparsest cut heuristic in fact obtains an O(αn)-approximation. The algorithm also applies to a generalized cost function studied by Dasgupta. Moreover, we obtain a strong inapproximability result, showing that the Hierarchical Clustering objective is hard to approximate to within any constant factor assuming the SmallSet Expansion (SSE) Hypothesis. Finally, we discuss approximation algorithms based on convex relaxations. We present a spreading metric SDP relaxation for the problem and show that it has integrality gap at most O( √ logn). The advantage of the SDP relative to the sparsest cut heuristic is that it provides an explicit lower bound on the optimal solution and could potentially yield an even better approximation for hierarchical clustering. In fact our analysis of this SDP served as the inspiration for our improved analysis of the sparsest cut heuristic. We also show that a spreading metric LP relaxation gives an O(log n)-approximation. ∗Computer Science Department, Stanford University, [email protected], [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Clustering via Spreading Metrics

We study the cost function for hierarchical clusterings introduced by [Dasgupta, 2016] where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters. It was also shown in [Dasgupta, 2016] that a top-down algorithm returns a hierarchical clustering of cost at most O (αn log n) times the cost of the optimal hierarchical clustering, where ...

متن کامل

A Graph-Theoretic Clustering Methodology Based on Vertex-Attack Tolerance

We consider a schema for graph-theoretic clustering of data using a node-based resilience measure called vertex attack tolerance (VAT). Resilience measures indicate worst case (critical) attack sets of edges or nodes in a network whose removal disconnects the graph into separate connected components: the resulting components form the basis for candidate clusters, and the critical sets of edges ...

متن کامل

The use of sparsest cuts to reveal the hierarchical community structure of social networks

Weshow that good community structures can be obtained by partitioning a social network in a succession of divisive sparsest cuts. A network flow algorithm based on fundamental principles of graph theory is introduced to identify the sparsest cuts and an underlying hierarchical community structure of the network via maximum concurrent flow. Matula [Matula, DavidW., 1985. Concurrent flow and conc...

متن کامل

Hierarchical Overlapping Clustering of Network Data Using Cut Metrics

A novel method to obtain hierarchical and overlapping clusters from network data – i.e., a set of nodes endowed with pairwise dissimilarities – is presented. The introduced method is hierarchical in the sense that it outputs a nested collection of groupings of the node set depending on the resolution or degree of similarity desired, and it is overlapping since it allows nodes to belong to more ...

متن کامل

Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering

We present and analyze a new framework for graph clustering based on a specially weighted version of correlation clustering, that unifies several existing objectives and satisfies a number of attractive theoretical properties. Our framework, which we call LambdaCC, relies on a single resolution parameter λ, which implicitly controls both the edge density and sparsest cut of all output clusters....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017